Developing a predictive model for an emerging epidemic on cassava in sub-Saharan Africa

The agricultural productivity of smallholder farmers in sub-Saharan Africa (SSA) is severely constrained by pests and pathogens, impacting economic stability and food security. An epidemic of cassava brown streak disease, causing significant yield loss, is spreading rapidly from Uganda into surrounding countries. Based on sparse surveillance data, the epidemic front is reported to be as far west as central DRC, the world’s highest per capita consumer, and as far south as Zambia. Future spread threatens production in West Africa including Nigeria, the world’s largest producer of cassava. Using innovative methods we develop, parameterise and validate a landscape-scale, stochastic epidemic model capturing the spread of the disease throughout Uganda. The model incorporates real-world management interventions and can be readily extended to make predictions for all 32 major cassava producing countries of SSA, with relevant data, and lays the foundations for a tool capable of informing policy decisions at a national and regional scale.


S1.1 Model development
In order to develop a parameterised spatial model of the cassava brown streak disease epidemic, we began by considering a range of model structures. The models we considered included different dispersal kernels, the incorporation of data for vector abundance, and the incorporation of survey bias. For each model structure, we carried out preliminary model parameterisation via ABC rejection using a subset of the Ugandan CBSD surveillance data (d f it real ) (see also Parameter estimation in Main Text Methods) and preliminary validation against the remaining surveillance data (d val real ) (Model validation in Main Text Methods), which allowed us to quantify the predictive power of each model. In this context, the term preliminary simply means the number of fitting and validation simulations were lower than for the final model presented in the Main Text. These analyses led to the rejection of the use of an exponential dispersal kernel and the selection of the model structure showing the best correspondence to the validation surveillance data.
All models shared a set of foundational features. Firstly, as cassava is a vegetatively propagated crop, all models comprised an SI compartmental structure as infection persists from harvest to the subsequent planting [1]. Secondly, all models included 1 the cassava host landscape (Incorporating data-driven model layers in Main Text Methods). Parsimony guided our process of model development to find the simplest epidemiologically-informed model structure that can reproduce and predict the dynamics of the CBSD epidemic. Supplementary Figure 1 gives an overview of the parameterisation and validation methodology as applied during model development and for the final selected model.

S1.1.1 Estimating false negative survey detection rate
The surveillance protocol involved sampling 30 plants from only the dominant cassava variety in a field (CBSD surveillance data in Main Text Methods). For a subset of the Ugandan surveys, surveyors recorded whether CBSD symptoms were present anywhere in the field, not just from the 30 plant sample or the dominant cultivar. Therefore, we can use this subset of surveys to approximate the level of disease underestimation in the field. Supplementary Figure 2 shows the variability in this underestimation rate across the 5 years with an approximate mean of 0.15, which was taken as the simulated surveillance false negative rate.

S1.1.2 Kernel selection
During exploratory stages of model development, two candidate kernel functions were considered: an exponential (Equation S1) and a power law (Equation S2): An exponential kernel produces higher rates of local spread and lower rates of longer range spread. In contrast, a power law kernel has slightly higher rates of long range spread and lower rates of local spread. The CBSD spatiotemporal pattern observed in the surveillance data from Uganda indicated significant long range spread combined with lower rates of local bulk up [2]. Preliminary simulations confirmed that an exponential kernel showed poor correspondence with the data (results not presented) compared with a power law kernel. Are the region(s) of highest posterior density well sampled?
Supplementary Figure 1: Overview of the parameterisation and validation methodology as applied during model development. Candidate models are proposed and preliminary fitting and validation is carried out to assess the ability of the model to reproduce the dynamics of d f it real and predict the dynamics of d val real . For the candidate model that performs best in comparison with the other candidates during preliminary validation, additional batches of simulations are run covering the broad region of highest posterior density to improve the sampling density. The summary statistic assessment is then carried out to identify appropriate summary statistic tolerances for the fitting criteria, given the density of sampling from the prior.

S1.1.3 Candidate models
We iteratively developed and parameterised three candidate models based on the selection of the power law kernel, which we refer to as model 0, model 1 and model 2.
The simplest model comprised the host landscape and the SI structure with a power law kernel. The two subsequent models built upon this foundation. We assessed the performance of the initial model (model 0), then iteratively added additional features. Supplementary Table 1 summarises the features included in each model iteration.
The two additional features included in model iterations model 1 and model 2 were the allowance for a false negative survey detection rate and the spatial vector abundance layer. The false negative survey detection rate refers to the modification of the simulated surveillance scheme to allow for false negatives in cases, such as surveyors not recording disease when present on varieties other than on the dominant variety in a field (cf. CBSD surveillance data in Main Text Methods and Supplementary Methods S1.1.1). The spatial vector abundance layer introduced in model 2 represents the relative abundance of the CBSI vector, B. tabaci, at a 5 km resolution. The vector layer allowed the rate at which infection spreads to, from, and within a given location to be modulated by the local abundance of the vector (Figure 1 and see also Model structure in Main Text Methods).
For each model, we carried out preliminary ABC parameter estimation and model validation. In contrast to the method described for the final model in the Main Text, during model development the number of fitting simulations was limited to between 20,000 to 30,000 and the number of validation simulations to 1,000. The three parameters to be estimated were the proportion of dispersal that remains within the source cell, p, the kernel scale parameter, α, and the transmission rate, β. Supplementary Figure 3 shows the posterior distributions obtained from preliminary parameter estimation for each of the three models. The tolerances applied to the three summary statistics were ϵ f it inf = 0.3, ϵ f it grid = 0.58, which we define as the refers to the proportion of simulations that, on a per simulation basis, pass within both the fitting and validation tolerances, divided by the proportion that pass within just the fitting tolerances.
preliminary fitting criteria.
Supplementary Figure 4 shows the behaviour of each model from an ensemble of 1, 000 validation simulations through the lens of the two infectious proportion-based metrics. All three models appear capable of reproducing the infectious proportion dynamics in the central area surrounding Kampala. However, there is a clear improvement in performance for the Ugandan national infectious proportion metric from model 0 to model 2. In order to quantify the performance of the preliminary validation simulations, the same methodology as described in the Main Text is used, but with different tolerances. Specifically, for each 1,000 validation simulations for each model, we calculate the proportion of simulations that meet the preliminary fitting criteria. For this subset of simulations, we calculate the proportion that sufficiently correspond to the validation dataset, d val real , which we define as ϵ val inf = 0.3. Supplementary Table 2 summarises these results. The strong preliminary results of model 2 justified our selection of this model structure for a final round of more intensive ABC parameter estimation.

S1.1.4 Estimating the clean seed replacement parameter
The parameterised version of model 2 is used to estimate the value of the clean seed replacement proportion parameter,   Table 3 summarises the proportion of simulations for each value of r clean that pass the fitting and validation criteria with the tighter tolerances and indicates that a value of 0.15 for r clean best captures the dynamics of the clean seed programme (Supplementary Figure 5).

S1.2.1 Efficient sampling from the prior
For the process of ABC rejection applied to computationally intensive models, a major challenge is to run enough simulations to explore the prior sufficiently with finite computational resources [3]. We used a high performance computing (HPC) cluster, which allowed us to parallelise the fitting and validation simulations. Nonetheless, the total compute time for the 233,600 fitting simulations was 87,157 hours, with a mean simulation time of 23 minutes. Therefore, in order to make efficient use of computational resources, for parameters α and β, we sampled from a joint uniform prior and carried out multiple batches of simulations, updating the search space at  Figure 7). For the proportion of dispersed inoculum that remains in the source cell, p, the search space remained the full uniform prior between 0 and 1. For all parameters, we sampled continuously within parameter space without discretisation.
Carrying out sequential batches in this way is a more efficient use of resources, allowing a wide initial exploration of parameter space whilst spending the majority of computational resource on regions of highest likelihood density (i.e. more relevant regions). Initially sampling from the full search space defined by the prior (see batch 0 of Supplementary Figure 7) ensures that we identify the optimal region(s) before restricting the search space to regions of higher likelihood density. Moreover, this allows us to apply the method that uses the summary statistics to recover known parameters using synthetic data, to ensure that the fitting techniques can successfully discern between parameters from different regions of parameter space (Supplementary Methods S1.3). Figure 7: The density of sampling of the kernel exponent, α, and the log of the transmission rate, ln(β), during parameter estimation. A uniform prior was used for the third estimated parameter, p, being the proportion of dispersed inoculum that remains in the source cell. Simulations were carried out in batches, iteratively exploring and updating the parameter search space for each subsequent batch, with the aim of converging towards and densely sampling the region with highest posterior probability density. The total number of simulations in batches 0 to 3 respectively are: 22,603, 71,183, 59,878, 79,936.

S1.2.2 Calculating the posterior distribution
For each of the estimated parameters, we discretise parameter space into cuboids of the following dimensions: α: 0.02, ln(β): 0.1, p: 0.02. The approximate likelihood is calculated as the number of accepted samples within each cuboid divided by the total number of samples within the cuboid. Note that the number of samples per cuboid affects the width of the confidence interval obtained for likelihood values, not the likelihood itself. We choose to average samples over cuboids, rather that taking individual samples as it allows pooling of many samples to get a smoother posterior distribution, minimising noise in the final output.
As the prior is uniform, the approximate likelihood is taken as the posterior. When simulating using the posterior, we sample from the distribution of cuboids, first selecting a cuboid according to it's posterior value and then selecting parameter values by sampling uniformly within the selected cuboid.

S1.3 Summary statistic assessment using synthetic data
The goal of estimating parameters using the Ugandan CBSD surveillance data is to calibrate the model to the dynamics of the spatio-temporal spread observed in the data. Therefore, the process of summary statistic selection is to identify statistics with the ability to capture the landscape-scale spatiotemporal spread behaviour of the CBSD epidemic. However, the optimal method for selecting summary statistics and tolerances remains an open question in the likelihood-free inference literature [4][5][6]. For high dimensional data, it may not be computationally feasible to use summary statistics that meet the statistical definition of sufficiency and it is common practice to select summary statistics based on their ability to capture an important property of the system dynamics, which are described as informative summary statistics. [7].
In the Main Text Methods we describe three informative statistics used for ABC parameter estimation. These statistics have been designed to discriminate between epidemics with different landscape-scale dynamic behaviour. Here, we outline the methods used to assess the suitability of a given statistic.

S1.3.1 Methods
The process of summary statistic assessment begins by selecting model parameters from the prior distribution and running model 2 using these known parameters to generate synthetic epidemic surveillance data, d syn . The location and timing of simulated surveillance for the synthetic data is retained from the real-world Ugandan surveillance data.
We then perform the ABC parameter estimation procedure (cf. Parameter estimation in Main Text Methods) to derive a posterior distribution for the parameters using a given statistic, S, or combination of statistics. Specifically, for each of the generated d syn data sets, ABC parameter estimation is carried out multiple times, first using each of the summary statistics, S cen , S nat , S grid independently, then using all three statistics in combination. For a given statistic or statistics in combination, as the tolerance(s) is reduced, ϵ → 0, we assess the behaviour of the statistic to discern between different regions of parameter space and converge towards the known input parameters.  To test whether a statistic is robust to the stochasticity of the model, the statistic evaluation procedure is repeated to generate 10 distinct sets of synthetic surveillance data for a given set of input parameters. By fitting to synthetic data, the selection procedure emulates the process of estimating parameters on the real-world data, but when the true parameters are known.
The entire process was repeated for three different parameter sets to generate synthetic data with different spread behaviour. The input parameter value sets are given in Supplementary Table 4. The proposed summary statistics are used independently and in combination to attempt to recover the known parameter values, hence assessing the robustness of the summary statistics to a diversity of epidemics.

S1.3.2 Results and discussion
For all of the presented results, we have selected representative figures from the 10 replicate simulations per parameter set. Supplementary Figure 8 shows the outputs of the simulated surveillance scheme from a representative synthetic simulation for each input parameter set. Importantly, each parameter set results in epidemics with visually distinct spread patterns.
Supplementary Figure 9 to Supplementary Figure 11 represent pairwise plots of the posterior distributions based on α and ln(β), derived by carrying out ABC parameter estimation using each of the summary statistics in isolation with a given d syn dataset as generated by the corresponding known parameters. The figures contrast posterior distributions derived using increasingly smaller tolerances. As the tolerance are reduced, the posterior distributions consistently converge towards the simulation parameters.  Table 4, highlighting the fundamental differences in the spatial structure of the different epidemics as the input parameters are varied. Red crosses indicated CBSD detection at the field level, and green crosses for a report of CBSD absence.
Supplementary Figure 12 illustrates the equivalent process using all three summary statistics in combination for ABC parameter estimation, which appears to perform especially well. Similarly, Supplementary Figure 13 and Supplementary Figure 14 show pairwise plots of the remaining parameter combinations when summary statistics are used in combination and also show a general trend of convergence to the known parameter values. The strong performance of the three statistics in combination motivated their selection for ABC parameter estimation with the Ugandan surveillance data.
It is important to note that a number of the pairwise posterior plots appear to be sparsely populated as the tolerances are reduced to very small values. This is due to non-uniform sampling density, wherein the majority of the fitting simulations are concentrated in the region of parameter space containing the highest posterior density for the real-world Ugandan survey data (Main Text Figure 3). Nonetheless, for each of the input parameter sets, the sampling intensity appears sufficient to indicate convergence towards the known parameter values.